Understanding Online Hate Speech as a Motivator and Predictor of Hate Crime, Los Angeles, California, 2017-2018 (ICPSR 37470)

Version Date: Jul 28, 2021 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
Meagan Cahill, RAND Corporation; Katya Migacheva, RAND Corporation; Jirka Taylor, RAND Corporation; Matthew Williams, Cardiff University; Pete Burnap, Cardiff University; Amir Javed, Cardiff University; Han Liu, Cardiff University; Hui Lu, RAND Europe; Alex Sutherland, RAND Europe

https://doi.org/10.3886/ICPSR37470.v1

Version V1

Slide tabs to view more

In the United States, a number of challenges prevent an accurate assessment of the prevalence of hate crimes in different areas of the country. These challenges create huge gaps in knowledge about hate crime--who is targeted, how, and in what areas--which in turn hinder appropriate policy efforts and allocation of resources to the prevention of hate crime. In the absence of high-quality hate crime data, online platforms may provide information that can contribute to a more accurate estimate of the risk of hate crimes in certain places and against certain groups of people. Data on social media posts that use hate speech or internet search terms related to hate against specific groups has the potential to enhance and facilitate timely understanding of what is happening offline, outside of traditional monitoring (e.g., police crime reports). This study assessed the utility of Twitter data to illuminate the prevalence of hate crimes in the United States with the goals of (i) addressing the lack of reliable knowledge about hate crime prevalence in the U.S. by (ii) identifying and analyzing online hate speech and (iii) examining the links between the online hate speech and offline hate crimes.

The project drew on four types of data: recorded hate crime data, social media data, census data, and data on hate crime risk factors. An ecological framework and Poisson regression models were adopted to study the explicit link between hate speech online and hate crimes offline. Risk terrain modeling (RTM) was used to further assess the ability to identify places at higher risk of hate crimes offline.

Cahill, Meagan, Migacheva, Katya, Taylor, Jirka, Williams, Matthew, Burnap, Pete, Javed, Amir, … Sutherland, Alex. Understanding Online Hate Speech as a Motivator and Predictor of Hate Crime, Los Angeles, California, 2017-2018. Inter-university Consortium for Political and Social Research [distributor], 2021-07-28. https://doi.org/10.3886/ICPSR37470.v1

Export Citation:

  • RIS (generic format for RefWorks, EndNote, etc.)
  • EndNote
United States Department of Justice. Office of Justice Programs. National Institute of Justice (2016MUMU0009)

Census tract

This data collection may not be used for any purpose other than statistical reporting and analysis. Use of these data to learn the identity of any person or establishment is prohibited. To protect respondent privacy, the data files in this collection are restricted from general dissemination. To obtain these restricted files, researchers must agree to the terms and conditions of a Restricted Data Use Agreement.

Inter-university Consortium for Political and Social Research
Hide

2017-09-01 -- 2018-09-30
2017-09-01 -- 2019-07-31
  1. For additional information on the Understanding Online Hate Speech as a Motivator and Predictor of Hate Crime Study, please visit the Understanding Online Hate Speech as a Motivator and Predictor of Hate Crime website.
Hide

The overarching goals of the research were to (i) address the lack of reliable knowledge about hate crime prevalence in the U.S. by (ii) identifying and analyzing online hate speech and (iii) examining the links between the online hate speech and offline hate crimes. To achieve these goals, the project pursued the following three objectives:

  1. Classify online hate speech in terms of (i) which individuals and groups direct what kinds of speech (type and severity) at (ii) which groups and (iii) where the tweets are generated.
  2. Estimate the relationship between online hate speech classification and offline hate crime.
  3. Develop and test an empirical model to identify areas at increased risk of hate crimes.

The project drew on four types of data: recorded hate crime data, social media data, census data, and data on hate crime risk factors.

Recorded hate crime data served as a dependent measure in the analyses. Data were obtained on hate crimes recorded in 2017 and 2018 in L.A. County, compiled by the Los Angeles County Commission on Human Relations (LACCHR). These data represent the most comprehensive data set on hate crimes available in the county. The LACCHR receives hate crime incident reports from 46 law enforcement agencies, 5 community organizations, 36 school districts and 13 higher education institutions, as well as directly from victims. LACCHR staff review the data from all sources to determine whether each reported incident meets the definition of a hate crime as defined by applicable statutes. Staff also check for duplicate reports to ensure incidents are not double-counted. For incidents that occurred in public places, the investigators received the actual location of the incident; for those occurring in private locations, the investigators received mid-block location information. Data from LACCHR were coded into three categories for analyses: i) racially motivated hate crimes; ii) religion motivated hate crimes; iii) and sexual orientation motivated hate crimes. For the purposes of the ecological analysis, the data were then aggregated to census tracts, providing us with count data for each measure by census tract and year.

Social media data were the main independent measure of interest. Using the Twitter streaming Application Programming Interface (API) via COSMOS software (Burnap et al., 2014), all tweets posted between September 2017 and September 2018 and geotagged to L.A. County were collected. These data were used to derive a count of all geocoded tweets; 1,813,862 tweets were geolocated within L.A. County in 2017 and 2018.

Supervised machine learning classifiers were then built to identify hateful tweets targeting three characteristics: race (anti-African-American), religion (anti-Muslim, anti-Jewish) and sexual orientation (anti-lesbian, gay, and bisexual). Recorded hate crimes in L.A. County are most frequently reported to target one of these three characteristics. Three gold standard datasets of human coded annotations were generated to train the machine classifiers based on samples of tweets (see Appendix B for classifier results). The classifiers were then used to identify all hateful tweets in the dataset, including which characteristics the tweet targeted. Finally, all geolocated tweets were aggregated to census tracts, providing counts of all tweets and hateful tweets by tract. An important caveat to both social media and hate crime data is that neither represents a representative sample of the true population: only tweets from users opting to have their tweets geotagged and offline were captured, and only data on reported hate crimes was available.

Census data. The latest 5-year estimates from the American Community Survey were also collected for use as controls in analytic models. Relevant variables were selected based on literature that estimated hate crime using ecological factors (e.g. Green, 1998; Espiritu, 2004). These include age, employment status, race and educational attainment.

Hate crime risk factor data. Existing research literature was reviewed to identify particular environmental features that served as risk factors in risk-terrain models (see Table 2 for the full list of 20 variables). Data on these factors were obtained from public sources including the public L.A. County GIS portal.

Tweets over the course of one year from the general population of Los Angeles, California.

Event/Process, Text Unit

Variables include counts of the number of tweets with various types of hate speech, counts of hate crimes broken down by category, and variables on population with breakdowns by race, gender, age, and educational achievement.

Hide

2021-07-28

2021-07-28 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

  • Performed consistency checks.
  • Checked for undocumented or out-of-range codes.

Hide

Notes

  • The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.

  • One or more files in this data collection have special restrictions. Restricted data files are not available for direct download from the website; click on the Restricted Data button to learn more.

NACJD logo

This dataset is maintained and distributed by the National Archive of Criminal Justice Data (NACJD), the criminal justice archive within ICPSR. NACJD is primarily sponsored by three agencies within the U.S. Department of Justice: the Bureau of Justice Statistics, the National Institute of Justice, and the Office of Juvenile Justice and Delinquency Prevention.